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(54) Object-oriented document change detection 

(57) A method for detecting a change in a structured 
document, the method including a) retrieving a struc- 
tured document at a first time, b) parsing (he structured 
document into a first object hierarchy, c) indicating a first 
portion of the structured document to be monitored for 
changes, d) identifying a first object in the first object 
hierarchy that includes the first portion, e) storing a path 
of the first object in the first object hierarchy, f) retrieving 
the structured document at a second time, g) parsing 
the structured document into a second object hierarchy, 
h) locating in the second object hierarchy a second ob- 
ject corresponding to the first object, i) locating in the 
second object a second portion corresponding to the 
first portion, j) comparing the first and second portions, 
and k) providing a change notification if the first and sec- 
ond portions are not identical. 
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Description 

FIELD OF THE INVENTION 

[0001] The present invention relates to document 
change detection and notification. 

BACKGROUND OF THE INVENTION 

[0002] Tools for detecting changes in documents are 
well known in the art. The Microsoft DOS "FC" command 
compares two or more files and displays the differences 
between them. However, the FC command does not 
provide unattended, periodic document change detec- 
tion, nor does it allow users to specify specific portions 
of a document for which change detection is desired. 
[0003] U.S. Patent No. 5,898,836 to Freivald, et al. 
describes a change detection tool in which a user reg- 
isters a Hypertext Markup Language (HTML) based 
"web page" by submitting his electronic mail (e-mail) ad- 
dress and the Uniform Resource Locator (URL) of the 
desired document to a network server. The server fetch- 
es the document, and the user selects text on the web 
page for which change detection is desired. The docu- 
ment is then divided into sections bounded by HTML 
tags, and a checksum is generated and stored for the 
user-selected section. During periodic comparisons a 
fresh copy of the document is retrieved and again divid- 
ed into sections bounded by HTML tags for which check- 
sums are generated. The freshly-generated checksums 
are compared to the previously-stored checksums, 
identifying changed sections as those having non- 
matching checksums. Changed checksums inside the 
user-selected section generates a change notification. 
Re-ordering of sections, as well as format and layout 
changes, do not generate a change notification when 
the checksums otherwise match. However, the greater 
the target text granularity (i.e., the smaller the HTML 
section), the greater the risk that a duplicate section ex- 
ists for which no solution is provided. 
[0004] A method for document change detection is 
therefore needed that overcomes the disadvantages of 
the prior art by allowing unattended, periodic document 
change detection of a user-specified document portion 
while providing greater accuracy at high target text gran- 
ularity. 

SUMMARY OF THE INVENTION 

[0005] There is thus provided in accordance with one 
embodiment of the present invention a method for de- 
tecting a change in a structured document, the method 
including a) retrieving a structured document at a first 
time, b) parsing the structured document into a first ob- 
ject hierarchy, c) indicating a first portion of the struc- 
tured document to be monitored for changes, d) identi- 
fying a first object in the first object hierarchy that in- 
cludes the first portion, e) storing a path of the first object 



in the first object hierarchy, f) retrieving the structured 
document at a second time, g) parsing the structured 
document into a second object hierarchy, h) locating in 
the second object hierarchy a second object corre- 

5 sponding to the first object, i) locating in the second ob- 
ject a second portion corresponding to the first portion, 
j) comparing the first and second portions, and k) pro- 
viding a change notification if the first and second por- 
tions are not identical. 

10 [0006] In a further aspect of the present invention the 
parsing step b) includes presenting at least the first por- 
tion of the structured document as a hypertext link, and 
indexing the hypertext link to the first object, and the in- 
dicating step c) includes selecting the hypertext link. 

15 [0007] In a further aspect of the present invention the 
identifying step d) includes identifying at least one head- 
ing associated with the first portion. 
[0008] In a further aspect of the present invention the 
identifying at least one heading step includes identifying 

20 at least one common property of a plurality of non-head- 
ing portions of the first object, comparing the common 
property with a corresponding candidate heading por- 
tion of the first object, and identifying the heading portion 
as a heading where the common property of the non- 
25 heading candidate portions differs from the correspond- 
ing candidate heading portion. 
[0009] in a further aspect of the present invention the 
locating step h) includes identifying a heading of the sec- 
ond object that matches the heading of the first portion. 

30 [001 0] In a further aspect of the present invention the 
candidate heading portion is either of a row and a col- 
umn at an edge of the first object. 
[0011] In a further aspect of the present invention the 
identifying at least one heading step includes identifying 

35 at least one property of an object neighboring the first 
object that is absent from the first object. 
[001 2] In a further aspect of the present invention the 
locating step h) includes identifying a heading of the sec- 
ond object that matches the heading of the neighboring 

40 object. 

[0013] The disclosures of all patents, patent applica- 
tions, and other publications mentioned in this specifi- 
cation and of the patents, patent applications, and other 
publications cited therein are hereby incorporated by 
45 reference. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0014] The present invention will be understood and 
so appreciated more fully from the following detailed de- 
scription taken in conjunction with the appended draw- 
ings in which: 



Fig. 1 is a simplified flowchart illustration of a meth- 
55 od of preparation for document change detection, 
operative in accordance with a preferred embodi- 
ment of the present invention; 
Figs. 2A, 2B, and 2C are, respectively, an exempla- 
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ry document portion, its HTML source-code repre- 
sentation, and its object hierarchy, and which are 
useful in understanding the method of Fig. 1 ; 
Fig. 3, which is a simplified flowchart illustration of 
a method of document change detection, operative 
in accordance with a preferred embodiment of the 
present invention; 

Fig. 4, which is a simplified flowchart illustration of 
a method of text indication, operative in accordance 
with a preferred embodiment of the present inven- 
tion; 

Fig. 5 is a simplified flowchart illustration of a meth- 
od of preparation for document change detection, 
operative in accordance with a preferred embodi- 
ment of the present invention; 
Figs. 6A, 6B, and 6C are, respectively, an exempla- 
ry document portion, its HTML source-code repre- 
sentation, and its object hierarchy, and which are 
useful in understanding the method of Fig. 5; and 
Fig. 7, which is a simplified flowchart illustration of 
a method of document change detection, operative 
in accordance with a preferred embodiment of the 
present invention. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS 

[0015] Reference is now made to Fig. 1, which is a 
simplified flowchart illustration of a method of prepara- 
tion for document change detection, operative in ac- 
cordance with a preferred embodiment of the present 
invention, and additionally to Figs. 2A, 2B, and 2C which 
are, respectively, an exemplary document portion, its 
HTML source-code representation, and its object hier- 
archy, and which are useful in understanding the meth- 
od of Fig. 1 . In the method of Fig. 1 a structured docu- 
ment, such the HTML document seen in Figs. 2A and 
2B, is selected by a user (step 100). The document is 
then parsed using conventional techniques and ex- 
pressed as a hierarchy of nested document objects 
(step 110). For example, the hierarchy of document ob- 
jects shown in Fig. 2A may be expressed as the object 
hierarchy shown in Fig. 2C where an HTML object is at 
the top of the hierarchy (lines 1 - 35) within which is a 
HEAD object (lines 2 - 4) and a BODY object (lines 5 - 
34). The BODY object includes a TABLE object (lines 
10-31 ). The TABLE object in turn includes first, second, 
and third ROW objects (at lines 12, 18, and 24). ROW 
object 1 in turn includes first, second, third, and fourth 
COLUMN objects (lines 13 - 16), and so on. 
[001 6] The user then indicates the text within the doc- 
ument upon which change detection is to be provided 
(step 120). The text indication may be done using con- 
ventional techniques, such as by highlighting the text us- 
ing a pointing device or using other known means. The 
document object or objects that contain the indicated 
text are then identified (step 1 30). Where two document 
objects include the same indicated text, such as where 



4 

the same text appears more than once in a single web 
page, the user may be prompted to select a single "tar- 
get" object that contains the selected text, such as by 
displaying the indicated text together with surrounding 

5 text, or using any other known prompting technique 
(step 140). An alternative duplicate text resolution tech- 
nique is described hereinbelow with reference to Fig. 4. 
If the target object is a table, then heading resolution is 
preferably performed (steps 1 50 - 1 80). If the target ob- 

10 ject is not a table, then the indicated text is stored to- 
gether with the hierarchical "path" of its object within the 
document (step 190). 

[0017] If the target object is a table, then ROW and 
COLUMN headings of each TABLE object are then de- 

15 termined by identifying one or more common properties 
of non-heading rows and columns (step 160). A non- 
heading row or column is typically one that is not located 
at an edge of a table, where "edge" as defined herein 
may refer to the first or last row or column of a table. 

20 Such common properties may include any of font type, 
color, and/or bold, underline, etc., whose values are 
identical between non-heading rows and/or columns. 
The common properties are then compared with those 
of the candidate heading rows and columns (step 1 70). 

25 A candidate heading row or column is typically one that 
is located at an edge of the table, typically the top row 
and/or left column of the table. Where the properties 
and/or their values are different, the labels of the candi- 
date heading are identified as headings, and are then 

30 stored (step 180). For example, in Fig. 2A the text 
"Stock," "High," "Low," and "Close" are identified as col- 
umn headings, and "MS FT' and "CISC" are identified 
as row headings. 

[0018] Thus, in the example of Fig. 2A, were the MS- 

35 FT closing price selected at reference numeral 10, its 
path might be expressed as "The first HTML object, the 
first TABLE object, the third ROW object, the fourth 
COLUMN object", and its associated headings as ROW 
HEADING="MSFT" and COLUMN HEADING="Close". 

40 [0019] Reference is now made to Fig. 3, which is a 
simplified flowchart illustration of a method of document 
change detection, operative in accordance with a pre- 
ferred embodiment of the present invention. In the meth- 
od of Fig. 3 the document previously selected by the us- 

45 er and processed in accordance with the method of Fig . 
1 is periodically retrieved (step 300). The document is 
then parsed in the manner described hereinabove with 
reference to steps 110 - 150 of Fig. 1 (step 310). The 
object path of the text for which change detection has 

50 been requested is then applied to the newly-retrieved 
document (step 320). Thus, continuing with the preced- 
ing example, the fourth COLUMN object of the third 
ROW object of the first TABLE object of the first HTML 
object is retrieved. The stored row and/or column head- 

55 ings are then compared to those of the newly-retrieved 
object (step 330). Should they match, the stored text is 
then compared to the newly-retrieved text (step 340), 
with a change notification being sent to the user where 
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a change is detected (step 350). The notification may 
be effected using any conventional messaging tech- 
nique, such as via email or Short Message Service 
(SMS) notification. Should the row and/or column head- 
ings at the specified row and column not match, the rows 
and columns of the object are searched until a matching 
row and column heading are found (step 360). Should 
only a row heading match, but not a column heading, or 
vice versa, the corresponding column/row number may 
be used. Should no matching row and column headings 
be found within the object, or should the object path be 
unresolvable (e.g., no third table object at this level of 
the object hierarchy) other objects of the same type at 
the current level in the object hierarchy are similarly 
searched for matching headings (step 370). Increasing- 
ly higher levels of the hierarchy may also be searched 
in a similar manner (step 380). 
[0020] Reference is now made to Fig. 4, which is a 
simplified flowchart illustration of a method of text indi- 
cation, operative in accordance with a preferred embod- 
iment of the present invention. In the method of Fig. 4 
the structured document is selected by the user (step 
400). The document is then reformatted such that each 
text element (i.e., alphabetic, numeric, or alphanumeric 
word) is converted to a unique, clickable hypertext link 
(step 410). The links are indexed in relation to the doc- 
ument object which contains the link and, optionally, rel- 
ative to duplicate text links within the object (step 420). 
Thus, the user may indicate text within the document 
simply by clicking on the text link upon which change 
detection is to be provided (step 430). Since the link is 
uniquely indexed, text selection may be carried out un- 
ambiguously where the text is otherwise duplicated 
within the document, and no duplication resolution 
methods are required. 

[0021] Reference is now made to Fig. 5, which is a 
simplified flowchart illustration of a method of prepara- 
tion for document change detection, operative in ac- 
cordance with a preferred embodiment of the present 
invention, and additionally to Figs. 6A, 6B, and 6C which 
are, respectively, an exemplary document portion, its 
HTML source-code representation, and its object hier- 
archy, and which are useful in understanding the meth- 
od of Fig. 5. In the method of Fig. 5 a structured docu- 
ment, such the HTML document seen in Figs, 6A and 
6B, is selected by a user (step 500). The document is 
then parsed using conventional techniques and ex- 
pressed as a hierarchy of nested document objects 
(step 510). For example, the hierarchy of document ob- 
jects shown in Fig. 6A may be expressed as the object 
hierarchy shown in Fig. 6C where an HTML object is at 
the top of the hierarchy within which is a HEAD object 
and a BODY object. The BODY object includes three 
PARAGRAPH objects (bounded by the tags <P> and <J 
P>). The second PARAGRAPH object in turn includes 
two objects: a BOLD TEXT object (bounded by the tags 
<B> and </B>), and a TEXT object following the BOLD 
TEXT object. 



[0022] The user then indicates the text within the doc- 
ument upon which change detection is to be provided, 
such as the second TEXT object within the second PAR- 
AGRAPH object in the current example (step 520). The 

5 document object or objects that contain the indicated 
text are then identified (step 530). Where two document 
objects include the same indicated text, duplicate text 
resolution techniques such as those described herein- 
above may be used until a single target object contain- 

10 ing the indicated text is identified (step 540). The closest 
neighboring object to the left of the target object or im- 
mediately preceding the target object is then checked 
for properties that the target object does not posses, 
such as bold, font size, italic, etc. One such property is 

15 typically selected to represent a M pseudo-heading' of the 
target object (step 550). The path of the target object 
and the selected text, as well as the pseudo-heading of 
the preceding object, are then stored (step 560). 
[0023] Reference is now made to Fig. 7, which is a 

20 simplified flowchart illustration of a method of document 
change detection, operative in accordance with a pre- 
ferred embodiment of the present invention. In the meth- 
od of Fig. 7 the document previously selected by the us- 
er is periodically retrieved and processed in accordance 

25 with the method of Fig. 5 (step 700). The document is 
then parsed in the manner described hereinabove with 
reference to steps 510 - 550 of Fig. 5 (step 710). The 
object path of the target object is then applied to the 
newly-retrieved document (step 720). Thus, continuing 

30 with the preceding example, the second TEXT object of 
the second PARAGRAPH object of the first BODY object 
of the first HTML object is retrieved. The closest neigh- 
boring object to the left of the target object or immedi- 
ately preceding the target object is then checked to see 

35 if it matches the stored pseudo-heading of the target ob- 
ject (step 730). Should they match, the stored text of the 
previously selected target object is then compared to the 
newly-retrieved text of the corresponding object (step 
740), with a change notification being sent to the user 

40 where a change is detected (step 750). 

[0024] It is appreciated that one or more of the steps 
of any of the methods described herein may be omitted 
or carried out in a different order than that shown, with- 
out departing from the true spirit and scope of the inven- 
ts tion. 

[0025] While the methods and apparatus disclosed 
herein may or may not have been described with refer- 
ence to specific hardware or software, the methods and 
apparatus have been described in a manner sufficient 

so to enable persons of ordinary skill in the art to readily 
adapt commercially available hardware and software as 
may be needed to reduce any of the embodiments of 
the present invention to practice without undue experi- 
mentation and using conventional techniques. 

55 [0026] While the present invention has been de- 
scribed with reference to a few specific embodiments, 
the description is intended to be illustrative of the inven- 
tion as a whole and is not to be construed as limiting the 



4 



10/13/06, EAST Version: 2.0.3.0 



7 



EP1 164 498 A1 



8 



invention to the embodiments shown. It is appreciated 
that various modifications may occur to those skilled in 
the art that, while not specifically shown herein, are nev- 
ertheless within the true spirit and scope of the inven- 
tion. 



Claims 

1 . A method for detecting a change in a structured 
document, the method comprising: 

a) retrieving a structured document at a first 
time; 

b) parsing said structured document into a first 
object hierarchy; 

c) indicating a first portion of said structured 
document to be monitored for changes; 

d) identifying a first object in said first object hi- 
erarchy that includes said first portion; 

e) storing a path of said first object in said first 
object hierarchy; 

f) retrieving said structured document at a sec- 
ond time; 

g) parsing said structured document into a sec- 
ond object hierarchy; 

h) locating in said second object hierarchy a 
second object corresponding to said first ob- 
ject; 

i) locating in said second object a second por- 
tion corresponding to said first portion; 

j) comparing said first and second portions; and 
k) providing a change notification if said first 
and second portions are not identical. 

2. A method according to claim 1 wherein said parsing 
step b) comprises: 



where said common property of said non-head- 
ing candidate portions differs from said corre- 
sponding candidate heading portion. 

5 5. A method according to claim 3 wherein said locating 
step h) comprises identifying a heading of said sec- 
ond object that matches said heading of said first 
portion. 

10 6. A method according to claim 4 wherein said candi- 
date heading portion is either of a row and a column 
at an edge of said first object. 

7. A method according to claim 3 wherein said identi- 
15 tying at least one heading step comprises: 

identifying at least one property of an object 
neighboring said first object that is absent from said 
first object. 

20 8. A method according to claim 7 wherein said locating 
step h) comprises identifying a heading of said sec- 
ond object that matches said heading of said neigh- 
boring object. 

25 



30 



35 



presenting at least said first portion of said 
structured document as a hypertext link; and *o 
indexing said hypertext link to said first object, 
and wherein said indicating step c) comprises 
selecting said hypertext link. 

3. A method according to claim 1 wherein said identi- 45 
fying step d) comprises identifying at least one 
heading associated with said first portion. 

4. A method according to claim 3 wherein said identi- 
fying at least one heading step comprises: so 



identifying at least one common property of a 
plurality of non-heading portions of said first ob- 
ject; 

comparing said common property with a corre- ss 
sponding candidate heading portion of said first 
object; and 

identifying said heading portion as a heading 
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Fig. 2A 



1 <HTML> 
<HEAD> 

<TITLE>Stock Prices</TITLE> 
</HEAD> 
5 <BODY> 

<P ALIGN="LEFT'>Stock Prices</P> 
<P ALIGN="LEFT"></P> 

10<TABLE BORDER CELLSPACING=2 BORDERCOLOR="#000000" CELLPADDING=7 WIDTH=240> 
<TR> 

<TD WIDTH= M 30%" VALIGN="TOP M ><B><P ALIGN="JUSTIFY ,, >Stock</B></TD> 
<TD WIDTH="23% M VALIGN="TOP"><B><P ALIGN="JUSTIFY H >High</B></TD> 
1 5<TD WIDTH="22%" VALIGN= ,, TOP"><B><P ALIGN="JUSTIFY H >Low</B></TD> 
<TD WIDTH="25% M VALIGN= ,, TOP"><B><P ALIGN= ,, JUSTIFY->Close</B></TD> 
</TR> 
<TR> 

<TD WIDTH= M 30%° VALIGN= M TOP"><B><P AUGN="JUSTIFY">CISC</B><m> 
20<TD WIDTHS 0 //' VALIGN=°TOP ,, ><FONT FACE="Ariar><P ALIGN="JUSTIFY">67<yFONT><ATD> 
<TD W!DTH="22%° VALIGN= M TOP"><FONT FACE= M Arial M ><P ALIGN="JUSTIFr , >64<yFONT></TD> 
<TD WIDTH="25%" VALIGN= ,, TOP H ><FONT FACE="Ariar><P ALIGN= M JUSTIFr>65</FONT><rTD> 
</TR> 
<TR> 

25<TD WIDTH="30%" VALIGN="TOP"><B><P ALIGN= ,, JUSTIFY">MSFT</B></TD> 

<TD WIDTH="23%" VALIGN='TOP"><FONT FACE="Ariar><P ALIGN= M JUSTIFr>108</FONT></TD> 
<TD WIDTH="22%" VALIGN='TOP"><FONT FACE= H Ariar><P ALIGN= M JUSTIFY U >104</FONT></TD> 
<TD WIDTHa-25%" VALIGN='TOP"><FONT FACE="Arial"><P ALIGN="JUSTIFr>106</FONT></TD> 
</TR> 

30 

</TABLE> 

<P ALlGN="JUSTIFY ,, x/P> 
</BODY> 
35</HTML> 



Fig. 2B 
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Microsoft MSFT 
Price: $97.00 
June 5, 2000 

Fig. 6A 



1 <HTML> 
<HEAD> 
</HEAD> 
<BODY> 

5 

<P ALIGN="LEFr>Microsoft MSFT</P> 

<P ALIGN= M LEFr><B>Price: </B>$97.00</P> 

<P ALIGN=TEFT*>June 5, 2000</P> 

10</BODY> 
</HTML> 



Fig. 6B 
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NOTIFICATION OF 
CHANGE 
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